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Abstract 

This  paper  presents  results  for  the  Japanese /English  cross-language  informaiton  retrieval 
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lation  in  the  edict  dictionary  is  comparable  with  the  use  of  every  translation.  Japanese 
term  segmentation  posed  no  unusual  problems,  which  contrasts  sharply  with  results  pre¬ 
viously  obtained  for  cross-language  retrieval  between  Chinese  and  English. 
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Abstract 

This  paper  presents  results  for  the  Japanese/English 
cross-language  information  retrieval  task  on  the 
NACSIS  Test  Collection.  Two  automatic  dictionary- 
based  query  translation  techniques  were  tried  with 
four  variants  of  the  queries.  The  results  indicate  that 
longer  queries  outperform  the  required  description- 
only  queries  and  that  use  of  the  first  translation  in  the 
edict  dictionary  is  comparable  with  the  use  of  every 
translation.  Japanese  term  segmentation  posed  no 
unusual  problems,  which  contrasts  sharply  with  re¬ 
sults  previously  obtained  for  cross-language  retrieval 
between  Chinese  and  English. 


1  Introduction 

Cross-language  information  retrieval  (CLIR)  deals 
with  the  problem  of  retrieving  information  in  lan¬ 
guages  different  from  that  of  the  query  [8].  Several 
effective  CLIR  approaches  are  now  known,  but  none 
have  yet  been  tested  on  large-scale  collections  that 
include  Asian  languages.  Several  Asian  languages 
lack  explicit  word  boundaries  in  their  written  form, 
and  this  poses  a  challenge  for  CLIR  systems  about 
which  little  is  presently  understood.  We  recently  ran 
an  experiment  using  Chinese  queries  to  retrieve  En¬ 
glish  documents  from  the  Text  REtrieval  Conference 
(TREC)  in  order  to  begin  to  address  this  issue  [9]. 
In  that  work  we  found  that  segmentation  errors  pro¬ 
duced  a  cascading  effect  through  translation  that  ul¬ 
timately  produced  inappropriate  term  weights,  thus 
depressing  retrieval  effectiveness.  In  the  NACSIS 
Test  Collection  Information  Retrieval  (NTCIR)  ex¬ 
periments  reported  in  this  paper  we  applied  the  same 
experiment  design  to  Japanese/English  retrieval  to 
explore  whether  the  problem  is  present  to  the  same 
degree  in  this  case. 


2  Background 

There  are  four  fundamental  ways  to  match  queries  in 
one  language  with  documents  in  another: 

•  Cross-language  matching.  Leave  the  queries 
and  the  documents  untranslated  and  embed 
translation  knowledge  in  the  matching  algorithm 

(e.g.,  [3]). 

•  Query  translation.  Translate  the  query  into 
the  documents’  language(s)  and  then  perform 
monolingual  retrieval  (e.g.  [1]). 

•  Document  translation.  Translate  the  docu¬ 
ments  into  the  supported  query  language(s)  and 
then  perform  monolingual  retrieval  (e.g.,  [7]). 

•  Interlingual  matching.  Translate  both  the 
queries  and  the  documents  into  a  language- 
neutral  representation  and  use  those  represen¬ 
tations  as  a  basis  for  retrieval  (e.g.,  [5]). 

In  cross-language  retrieval  between  European  lan¬ 
guages,  query  translation  has  proven  to  be  popular 
because  it  is  efficient  (for  relatively  short  queries), 
and  because  the  common  character  set  sometimes  re¬ 
sults  in  helpful  cross-language  exact  string  matches 
when  no  translation  is  known  for  a  word  (as  is 
commonly  the  case  with  proper  names,  for  exam¬ 
ple).  Dictionary-based  query  translation  (term-by- 
term  translation  using  a  term  list  built  from  a  bilin¬ 
gual  dictionary)  is  easily  implemented,  and  is  well 
known  to  produce  about  half  the  retrieval  effective¬ 
ness  (e.g.,  average  precision)  of  monolingual  sys¬ 
tems.  Since  our  primary  goal  is  to  understand  the 
additional  challenges  posed  by  Asian  languages,  we 
elected  to  use  dictionary-based  query  translation  (re¬ 
ferred  to  below  as  DQT)  for  our  experiments 

Eigure  1  illustrates  the  key  differences  between 
cross-language  retrieval  using  DQT  and  the  mono¬ 
lingual  case.  Queries  enter  from  the  left,  and  in  what 
are  called  “bag-of- words”  retrieval  systems  (i.e.,  those 
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Figure  1:  Comparison  between  cross-language  (top) 
and  monolingual  (bottom)  retrieval 


that  do  not  preserve  word  order  information),  the 
first  step  in  both  cases  is  to  select  terms.  In  Euro¬ 
pean  languages  this  can  involve  tokenization  on  white 
space,  phrase  recognition,  and  (for  languages  such  as 
German)  compound  splitting.  For  Asian  languages, 
the  corresponding  step  is  segmentation. 

Although  both  cross-language  and  monolingual 
bag-of-words  retrieval  systems  perform  term  selec¬ 
tion,  the  intended  use  of  the  selected  terms  differs. 
In  monolingual  systems,  the  selected  terms  will  be 
used  directly  for  matching.  The  so-called  “ranked 
retrieval”  systems  that  we  use  seek  to  place  docu¬ 
ments  that  best  match  the  query  closest  to  the  top 
of  a  ranked  list.  For  this  reason,  query  terms  that 
are  highly  selective  (i.e.,  that  appear  in  only  a  few 
documents)  typically  receive  greater  weight.^  The 
term  matching  stage,  where  weighted  query  terms  are 
matched  with  the  terms  found  in  the  documents,  is 
then  used  to  identify  documents  that  best  match  the 
query. 

In  cross-language  retrieval  using  DQT,  two  term 
selection  stages  are  needed.  The  goal  of  the  first  is 
to  discover  terms  for  which  translations  are  known, 
while  the  goal  of  the  second  is  to  select  the  best 
translation(s)  from  among  those  that  are  known  to 
be  possible.  Some  dictionaries  present  the  most  com¬ 
mon  translation  (in  general  usage)  first,  and  in  that 
case  a  useful  heuristic  is  to  choose  the  first  transla¬ 
tion  (DQT-FT).  In  other  cases,  a  more  conservative 
heuristic  in  which  every  translation  is  retained  for 
each  term  (DQT-ET)  has  proven  to  be  useful.  Since 
detailed  information  about  the  development  of  a  par¬ 
ticular  dictionary  can  be  difficult  to  obtain,  we  rou¬ 
tinely  compare  the  two  term  choice  strategies  when 
running  DQT  experiments. 


^This  measure  of  selectivity  is  generally  referred  to  as  the 
“inverse  document  frequency”  (IDF)  of  a  term.  For  reasons 
of  efficiency,  it  is  more  common  to  associate  IDF  weights  with 
every  occurrence  of  a  term  in  a  document  because  the  value 
can  be  computed  in  advance.  Associating  IDF  weights  with 
the  query  as  we  have  done  here  sheds  light  on  the  interaction 
between  query  translation  and  IDF  weights  without  altering 
the  retrieval  outcome. 


Term  weighting  serves  the  same  purpose  in  cross¬ 
language  retrieval — to  give  more  emphasis  to  the 
most  useful  terms.  In  experiments  with  automati¬ 
cally  segmented  Chinese  queries,  we  discovered  that 
assigning  term  weights  based  on  the  selectivity  of  a 
translated  term  caused  problems  because  segmenta¬ 
tion  errors  typically  produced  terms  for  which  many 
translations  were  known,  and  some  of  those  trans¬ 
lations  were  rare  (and  hence  highly  selective)  En¬ 
glish  words  [9].  Emphasizing  selective  terms  is  help¬ 
ful  when  weighting  query  terms  that  are  provided 
directly  by  the  user,  but  our  results  with  Chinese 
clearly  indicate  that  it  can  sometimes  be  dangerous 
to  apply  it  in  the  same  way  to  translated  terms. 

3  Experiment  Design 

Each  of  our  four  query  sets  was  formed  by  automati¬ 
cally  extracting  one  or  more  fields  from  the  given  top¬ 
ics.  The  query  set  was  then  passed  to  JUMAN  ver¬ 
sion  2.2  for  segmentation^.  The  first  column  of  the 
output  (the  component  words)  was  then  extracted 
and  passed  to  Dictionary-based  Query  Translation 
(DQT).  The  DQT  code  requires  a  query  set  and  a 
bilingual  dictionary  as  input  and  produces,  a  query 
set  with  the  translations  of  each  query  word  into  tar¬ 
get  language  as  output.  We  used  the  freely  available 
“edict’  Japanese/English  dictionary,  which  contains 
64,433  Japanese  terms  and  a  total  of  104,705  bilingual 
term  pairs.®  Some  preprocessing  was  done,  includ¬ 
ing  removal  of  pronunciation  information  and  (after 
our  official  submission),  and  removal  of  parentheti¬ 
cal  clauses  (which  are  generally  explanations  rather 
than  translations).  Our  existing  DQT  code  had  to 
be  modified  to  accommodate  multibyte  characters — 
we  did  this  by  converting  the  Japanese  characters  (in 
both  in  the  dictionary  and  the  query  set)  into  the 
corresponding  hexadecimal  representations. 

The  translated  queries  were  passed  to  version  3.1pl 
of  the  Inquery  information  retrieval  system,  which  we 
obtained  from  the  University  of  Massachusetts  [4]. 
Inquery  is  a  probabilistic  retrieval  system  based  on 
Bayesian  inference  networks.  In  our  experiment,  we 
used  “sum”  operator  to  form  queries.  The  sum  op¬ 
erator  calculates  the  value  of  the  belief  that  a  query 
is  satisfied  by  a  document  as  the  mean  of  the  beliefs 
associated  each  query  term.  The  Inquery  “kstem” 

^We  happened  to  have  an  installed  copy  of  JUMAN  2.2 
available,  and  our  inability  to  read  the  Japanese  documen¬ 
tation  for  JUMAN  prevented  us  from  installing  a  more  recent 
version  in  time  for  these  experiments.  JUMAN  3.61  is  available 
at  http://pine.kuee.kyoto-u.ac.jp/nl-resource/juman.html 

®The  edict  dictionary  is  freely  available  in  electronic  form 
from  Monash  University. 
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stemmer  and  the  standard  English  Inquery  stopword 
list  were  used  when  indexing  the  English  document 
collection. 

4  Results 

After  submitting  the  two  official  runs,  we  discov¬ 
ered  that  we  had  inadvertently  omitted  10  of  the 
scored  topics  from  the  run  in  which  we  used  NAR¬ 
RATIVE  field  to  form  the  queries  (umd2).  We  have 
corrected  this  mistake  in  the  experiments  reported 
here.  We  also  performed  the  dictionary  cleanup  de¬ 
scribed  above  between  our  official  results  and  the  ones 
reported  here.  In  all,  we  made  eight  runs  for  this  pa¬ 
per: 

•  DFT  Queries  formed  with  the  DESCRIPTION 
field  and  translated  with  DQT-ET  (submitted 
officially  as  umdl). 

•  DET  Queries  formed  with  the  DESCRIPTION 
field  and  translated  with  DQT-ET. 

•  JET  Queries  formed  with  the  J. CONCEPT  field 
and  translated  with  DQT-ET. 

•  JET  Queries  formed  with  the  J. CONCEPT  field 
and  translated  with  DQT-ET. 

•  NET  Queries  formed  with  the  NARRATIVE 
field  and  translated  with  DQT-ET  (submitted 
officially  as  umd2). 

•  NET  Queries  formed  with  the  NARRATIVE 
field  and  translated  with  DQT-ET. 

•  TNJDFT  Queries  formed  with  the  TITLE, 
NARRATIVE,  J.CONCEPT  and  DESCRIP¬ 
TION  fields  and  translated  with  DQT-ET. 

•  TNJDET  Queries  formed  with  the  TITLE, 
NARRATIVE,  J.CONCEPT  and  DESCRIP¬ 
TION  fields  and  translated  with  DQT-ET. 

Non-interpolated  average  precision  values  for  these 
eight  runs  are  shown  in  Table  1,  and  Figures  2  and 
3  show  the  11  point  recall- precision  graphs  for  DQT- 
ET  and  DQT-ET  respectively.  By  these  measures,  we 
achieved  the  best  overall  retrieval  effectiveness  by  us¬ 
ing  DQT-ET  with  queries  formed  from  all  four  topic 
fields.  The  insignificant  change  in  DFT  between  our 
official  submission  and  these  results  (from  0.0788  to 
0.0791)  is  due  solely  to  dictionary  cleanup.  The  inclu¬ 
sion  of  the  previously  omitted  queries  is  thus  the  obvi¬ 
ous  explanation  for  the  dramatic  increase  in  NET  be¬ 
tween  our  official  submission  and  these  results  (from 
0.0968  to  0.1204). 


DQT 

Topic  Fields 

D 

J 

N 

TNJD 

FT 

0.0704 

0.0981 

0.0996 

0.1337 

FT 

0.0791 

0.1056 

0.1204 

0.1534 

Table  1:  Non- 

interpolated  average  precision  (D=DESCRIPTION, 
J=J.CONCEPT,  N=NARRATIVE,  T=TITLE). 


Recall 


Figure  2:  Precision-recall  curves  with  DQT-ET. 

We  tested  our  results  for  statistical  significance  us¬ 
ing  paired  sample  t-tests.  The  significance  values  for 
pairwise  comparisons  between  topic  sets  when  DQT- 
ET  was  used  are  shown  in  Table  2.  Values  below 
0.05  are  generally  accepted  as  significant  is  studies  of 
this  type  [6].  In  this  test,  the  39  queries  are  taken 
as  random  samples  form  a  query  population,  the  11- 
point  average  precision  for  each  query  is  the  depen¬ 
dent  variable,  and  the  DQT  technique  and  the  query 
set  are  the  independent  variables.  We  found  that 
long  queries  often  outperform  short  queries.  For  ex¬ 
ample,  queries  formed  with  all  four  fields  (TNJDFT 
and  TNJDET)  perform  significantly  better  than  all 
the  other  six  sets  of  queries.  Queries  with  NARRA¬ 
TIVE  field  also  significantly  outperform  the  required 
queries  that  used  only  the  DESCRIPTION  field. 
However,  we  didn’t  observe  statistically  significant 
differences  (at  the  0.05  level)  between  queries  with 
DESCRIPTION  or  NARRATIVE  fields  and  queries 
with  J.CONCEPT  field. 

As  Figure  4  illustrates,  the  results  for  DQT-ET  and 
DQT-ET  were  quite  similar  when  averaged  over  all 
queries.  Statistical  significance  tests  failed  to  detect 
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Figure  3:  Precision-recall  curves  with  DQT-ET. 


Table  2:  Paired  sample  t-test  significance  values  for 
DQT-FT. 

a  significant  difference  between  DQT-FT  and  DQT- 
ET  for  any  of  the  four  query  sets.  The  query-by¬ 
query  comparison  in  Eigure  5  provides  some  addi¬ 
tional  insight,  showing  that  DQT-ET  noticeably  out¬ 
performed  DQT-ET  on  some  queries,  but  noticeably 
underperformed  it  on  others. 

We  explored  the  interaction  between  segmenta¬ 
tion  and  translation  by  examining  some  of  the  orig¬ 
inal,  segmented  and  translated  queries.  Although 
Japanese  in  written  form  is  similar  in  some  ways  to 
Chinese,  it  does  have  unique  characteristics.  Chinese 
texts  are  mainly  composed  of  hanzi  characters,  while 
Japanese  texts  are  composed  of  kanji,  hiragana,  and 
katakana.  A  character  set  change  provides  a  reliable 
cue  for  term  segmentation,  so  segmentation  is  inher¬ 
ently  easier  for  Japanese  than  for  Chinese.  Eurther- 
more,  hiragana,  which  is  common  in  the  queries  we 
examined,  often  represents  function  words  that  are 
of  little  use  with  bag-of- words  retrieval  techniques. 
There  are  few  English  translations  for  hiragana  in 
the  edict  dictionary.  So  even  if  the  segmenter  misseg- 
ments  hiragana,  edict  would  be  unlikely  to  propagate 
the  error  through  translation.  Together,  these  factors 
might  explain  why  the  cascading  errors  observed  in 


Eigure  4:  DQT-ET  vs.  DQT-ET. 
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quen/ 

Eigure  5:  Query-by-query  comparison  of  DQT-ET 
and  DQT-ET. 

Chinese  were  not  present  in  these  experiments. 

5  Conclusion 

We  have  tested  Japanese/English  cross-language  in¬ 
formation  retrieval  with  automatic  dictionary-based 
query  translation.  The  results  reveal  that  long 
queries  often  outperform  shorter  ones,  but  that  our 
two  query  translation  techniques  perform  compara¬ 
bly.  Japanese  term  segmentation  does  not  appear  to 
pose  problems  that  are  as  severe  as  those  that  we 
have  encountered  with  CLIR  between  Chinese  and 
English.  The  existence  of  multiple  character  types 
in  Japanese  seems  to  be  the  fundamental  reason  for 
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this.  In  future  work  we  plan  to  explore  additional 
cross-language  retrieval  techniques  in  the  context  of 
Asian  languages,  perhaps  including  the  application  of 
word  sense  disambiguation  approaches  such  as  those 
studied  by  Ballesteros  and  Croft  [2]. 

This  first  NTCIR  evaluation  has  provided  us  with 
valuable  experience  that  has  helped  us  to  deepen  our 
understanding  of  critical  issues  for  cross-language  in¬ 
formation  retrieval  using  Asian  languages.  We  ex¬ 
pect  that  the  test  collection  will  prove  to  be  a  valu¬ 
able  legacy,  now  permitting  a  broader  range  of  exper¬ 
iments  than  has  previously  been  possible. 
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